The dataset that this report will analyze is a collection of approximately 5000 Portuguese white wines called ‘Vinho Verde.’ There are 11 quantitative variables related to each wine that may or may not affect the most important variable we are concerned with, quality. In this analysis I hope to gain some insight into if and how these variables affect the quality of the wine.
In this first section I am first looking at summary statistics and then examining the distribution of each of these variables using histograms and boxplots.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Right away I can tell that there are some outliers in the fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, sulphates, and alcohol variables as the max values lie far outside the IQR. In addition, it’s a small but interesting note that on a quality scale of 0-10 no wine received a score of 0,1,2, or 10.
Looking at these histogram plots, we can see the outliers represented with the large x axes on the variables that they occur. It is also shown that the bulk of the data falls within a pH range of 2.8 and 3.6. This is surprising to me as it shows that the range of pH values is relatively small. If there is a correlation between the pH values of the wine and the quality, the acidity changes are very subtle.
I’m realizing that a lot of this data is very long tailed to the right. This affects the quality of the histogram plots. This is backed up by the boxplot grid where we see that in several of the variables there are many data points outside of the IQR. I’m going to do some transformations in order to clean up these distributions. First I will generate skewness values using the moments package. From what I understand, values from [-1,1] indicate moderate skew. A perfect normal distribution would have a skewness value of 0.
Skew values under no transformation
## X fixed.acidity volatile.acidity
## 0.0000000 0.6475531 1.5764965
## citric.acid residual.sugar chlorides
## 1.2815278 1.0767639 5.0217922
## free.sulfur.dioxide total.sulfur.dioxide density
## 1.4063141 0.3905902 0.9774735
## pH sulphates alcohol
## 0.4576423 0.9768944 0.4871927
## quality
## 0.1557487
Appears that volatile acidity, citric acid, residual sugar, chlorides, and free sulfur dioxide are all skewed to the right indicated by the positive skewness values larger than 1. Logarithmic transformations may be necessary for multivariate analyses.
Skew values under logarithmic transformation
## X fixed.acidity volatile.acidity
## -1.94372852 0.07682765 0.13934046
## citric.acid residual.sugar chlorides
## NaN -0.16110754 1.13378629
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.93603533 -0.98391453 0.93065675
## pH sulphates alcohol
## 0.29873113 0.23368576 0.31003964
## quality
## -0.40764675
We can evaluate the distribution of the variables under a logarithmic transformation. Looks like volatile acidity, residual sugar, and chlorides have a much more normal distribution under this transformation.
Skew values under square root transformation
## X fixed.acidity volatile.acidity
## -0.56494566 0.35117701 0.78807976
## citric.acid residual.sugar chlorides
## -0.42671390 0.31610663 2.84993244
## free.sulfur.dioxide total.sulfur.dioxide density
## 0.04967674 -0.16387369 0.95371995
## pH sulphates alcohol
## 0.37748822 0.59208653 0.39776908
## quality
## -0.10970400
For citric acid it appears the better option would be to use the square root transformation.
Volatile acidity looks much more normal under this transformation.
Residual sugar has a bimodal distribution under this transformation with peaks around .5 and 2.25. Interesting. I wonder why there is this dip at 1.1 log(residual.sugar).
Looks good.
Looks good.
This dataset contains 4,898 different wines with 1 id variable and 12 quantitative factors (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality).
The most important variable here is quality as we are attempting to determine how the other 11 variables influence quality. I would like to look into how the different acidity variables interact/influence each other.
I’m also curious if the sweet/salty (residual sugar/chlorides) levels of the wines are at all related to the pH values as I would suspect that the more acidity the sweeter it would have to be in order to counteract the higher acidity.
In the text file explaining the dataset, it says that the total sulfur dioxide is undetectable under 50 ppm and at higher levels will affect the nose and the taste of the wine. However, it doesn’t mention whether this affects the wine positively or negatively. This is certainly something we can look into as it appears that the vast majority of data points have total sulfur dioxide over 50ppm.
Not yet, although I may introduce new variables to categorize quality, pH, or sulfur dioxide if I feel that they will aid the analysis or make it more digestible.
Yes, the bulk of the variables in this dataset had significant skew. Specifically, I transformed the distributions of volatile acidity, residual sugar, chlorides, and citric acid. The majority of the univariate section is spent finding appropriate transformations to reduce the skew so that the statistical analysis and modeling will still adhere to the laws of inferential statistics.
Looking at the results of our ggpairs utility, I can see that the strongest direct correlation between quality occurs with alcohol content. There is a moderate positive correlation of ~.44. This makes me want to look at the alcoholic strength of wine vs. the quality. Further, I can see that there is a weak to moderate negative correlation between quality and volatile acidity, chlorides, free sulfur dioxide, and density. Its certainly worth examining these relationships to see why this may be the case.
Some other noteworthy observations include: -strong correlation (~.84)between density and residual sugar. -strong correlation values for residual sugar and free/total sulfur dioxide, I suspect that these variables may be important should I attempt to generate a model in the future.
First I’d like to examine how the various acidity variables affect pH. I want to know if it is possible for there to be changes in fixed/volatile acidity and citric acid without affecting the overall pH drastically.
According to these graphs, the pH is affected most by changes in fixed acidity. There is little to no correlation between the levels of volatile acidity or citric acid and pH.
Zooming out a little bit, I have made a scatter plot of pH and quality and a line plot showing mean quality for all pH levels. The scatterplot reinforces that the majority of the wines lie between values of 2.9 and 3.6 pH. In the mean quality graph, we can see an positive trend that indicates a moderate correlation between less acidity and quality. Towards the ends of the plot (< 2.9 and > 3.5) there is a significant amount of noise. This I believe can be accounted for by relatively fewer data points in these ranges. For this reason, I have zoomed in on the quality plot and we can see this positive correlation a little bit more clearly. It seems that there is a “sweet spot” between 2.9 and 3.6 where you want the pH value of your wine to lie such that it is not too acidic or too basic.
Here we have examined the mean quality of the wine vs the alcoholic content. The mean quality is represented in black, the median in orange, and the line model is rendered in blue. We can see the moderate correlation. It is amusing to note that as you might expect the stronger a wine is, the better it will be reviewed. Unfortunately, this graph is pretty noisy.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Here we can see the strong positive correlation between residual sugar and density. The more residual sugar, the sweeter and more dense the wine becomes. Interestingly, there is a moderate negative correlation between the density and the quality of the wine. Painting a bigger picture, one may assume that past a certain level of sweetness the quality may begin to drop. Let’s take a look at residual sugar and mean quality.
Here the data bears out my suspicion. I’ve used a rounding method to reduce the noise in the original plot. We can see the trend in residual sugar negatively affecting the quality. Now let’s look at similar graphs for density.
The slightly stronger negative correlation is on display here as well for density. Finally, I’d like to look at total sulfur dioxide past 50ppm and how it affects the quality of a wine.
It appears that there is only a very small negative correlation here. Going back to the ggpairs graph, we can see that the value of the correlation is only about -.18 so this makes sense.
This section was somewhat disappointing in that it did not turn up many strong relationships with quality. We are able to note a moderate positive correlation between alcohol and quality and moderate negative correlations between quality and density, residual sugar, and total sulfur dioxide.
After examining the various acidity related variables, we found surprisingly that citric acid and volatile acidity do not have a large if any impact on the overall pH of the wine. However, there is a moderate correlation between the fixed acidity and the pH of the wine.
The strongest relationship that was found is between residual sugar and density. This is logical as the mean density of wine is approximately that of water which lies around 1 g/(cm)^3 whereas the density of sugar is ~1.6. As the sugar content of wine increases, the density increases as well.
These graphs are identical to those in the bivariate plots section except I’ve colored the graphs according to a scale gradient based on the quality of the data points. We can see that the colors all blend towards a the middle range of the gradient in all areas of the graphs. This tells me that the quality is not drastically affected by these 3 variables. You would expect to see a concentration of higher or lower quality points in regions if that was the case.
Contrary to the previous section where we saw that there was no quality grouping based on the pH or any of the acidity variables, we can see a clear grouping of higher quality wines underneath the plotted trendline. We can also see a clear grouping of lower quality wines towards above the trendline. This corresponds neatly to the pearson coefficients generated by our ggpairs plot from earlier.
In this next section, I will build a linear model for the quality of wine. Here we must make sure to use the transformations from earlier so that the model can be constructed sticking as closely to the assumptions of inferential statistics as possible.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = df)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = df)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides),
## data = df)
## m4: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity), data = df)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide, data = df)
## m6: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity,
## data = df)
## m7: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity +
## pH, data = df)
## m8: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity +
## pH + log(residual.sugar), data = df)
## m9: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity +
## pH + log(residual.sugar) + sulphates, data = df)
## m10: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity +
## pH + log(residual.sugar) + sulphates + sqrt(citric.acid),
## data = df)
## m11: lm(formula = I(quality) ~ I(alcohol) + density + log(chlorides) +
## log(volatile.acidity) + total.sulfur.dioxide + fixed.acidity +
## pH + log(residual.sugar) + sulphates + sqrt(citric.acid) +
## free.sulfur.dioxide, data = df)
##
## ===================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -23.060*** -38.416*** -32.004*** -44.664*** -45.179*** 45.155*** 55.177*** 56.816*** 52.219***
## (0.098) (6.165) (6.151) (5.999) (6.266) (6.464) (6.484) (11.218) (11.369) (11.396) (11.408)
## I(alcohol) 0.313*** 0.360*** 0.334*** 0.380*** 0.383*** 0.399*** 0.401*** 0.301*** 0.286*** 0.282*** 0.284***
## (0.009) (0.015) (0.016) (0.015) (0.015) (0.015) (0.016) (0.018) (0.019) (0.019) (0.019)
## density 24.728*** 24.954*** 39.264*** 32.577*** 45.834*** 46.631*** -45.114*** -55.241*** -56.965*** -52.371***
## (6.079) (6.065) (5.909) (6.204) (6.427) (6.474) (11.331) (11.483) (11.513) (11.524)
## log(chlorides) -0.196*** -0.140*** -0.153*** -0.153*** -0.154*** -0.097* -0.099** -0.105** -0.102**
## (0.040) (0.038) (0.038) (0.038) (0.038) (0.038) (0.038) (0.038) (0.038)
## log(volatile.acidity) -0.615*** -0.630*** -0.642*** -0.645*** -0.658*** -0.649*** -0.634*** -0.599***
## (0.033) (0.033) (0.033) (0.033) (0.033) (0.033) (0.034) (0.034)
## total.sulfur.dioxide 0.001*** 0.001** 0.001** 0.001* 0.000 0.000 -0.001
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## fixed.acidity -0.100*** -0.107*** -0.027 -0.021 -0.027 -0.019
## (0.014) (0.015) (0.017) (0.017) (0.017) (0.017)
## pH -0.083 0.299*** 0.274** 0.291** 0.303***
## (0.081) (0.089) (0.089) (0.090) (0.089)
## log(residual.sugar) 0.238*** 0.259*** 0.261*** 0.246***
## (0.024) (0.025) (0.025) (0.025)
## sulphates 0.492*** 0.483*** 0.491***
## (0.098) (0.098) (0.098)
## sqrt(citric.acid) 0.215* 0.200
## (0.109) (0.109)
## free.sulfur.dioxide 0.004***
## (0.001)
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.196 0.250 0.252 0.260 0.260 0.275 0.278 0.279 0.283
## adj. R-squared 0.190 0.192 0.196 0.250 0.251 0.259 0.259 0.274 0.277 0.278 0.281
## sigma 0.797 0.796 0.794 0.767 0.766 0.762 0.762 0.755 0.753 0.753 0.751
## F 1146.395 583.290 398.940 408.099 329.683 286.787 245.969 231.479 209.545 189.087 174.909
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5818.842 -5649.559 -5643.429 -5616.372 -5615.848 -5568.008 -5555.523 -5553.586 -5541.504
## Deviance 3112.257 3101.773 3086.252 2880.126 2872.926 2841.359 2840.751 2785.798 2771.632 2769.441 2755.811
## AIC 11684.782 11670.255 11647.685 11311.119 11300.858 11248.743 11249.695 11156.017 11133.046 11131.172 11109.008
## BIC 11704.272 11696.241 11680.167 11350.098 11346.334 11300.716 11308.164 11220.983 11204.508 11209.131 11193.463
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ===================================================================================================================================================================================
Unfortunately, looking at the r-squared values generated from our model we can see that it is not very good.
We found that the fixed acidity, volatile acidity, and citric acid levels of the wines seem to have a relatively small impact on the quality of the wine.
In addition, we found that white wines that were less sweet were in fact more highly rated by the reviewers that participated in generating this data. We also illustrated and explained why and how an increase in residual sugar is strongly associated with an increase in density. This was the strongest relationship that was uncovered in this analysis.
I was surprised to see that there was little to no effect of citric acid on the quality of wines. I would expect with a white wine that the refreshing, fruity flavors of citric acid would be highly desirable but apparently that is not the case.
I did attempt to create a linear model for the quality of the wine. The strength of this model is that it encapsulates all of the variables that were provided in the dataset. The primary weakness of the model is that it is not very good. With all of the variables included, the best r-squared value that was achieved was .283. This is extremely suboptimal for a model that you would hope to use to predict a variable. I think that because all of the variables in this data set are varied and nuanced, you might expect that a linear approach would not be very fruitful. This is mentioned in the informational text included with the dataset. A more sophisticated model beyond my current level of analysis would be required to generate good predictions for the quality of white wines.
This plot is important as it shows the strongest relationship to quality that we uncovered in this analysis. Ironically, it is one you might have guessed before doing any type of in depth EDA. The stronger the wine, the more the reviewers seemed to like it. Although this is not a very nuanced insight, it is the strongest that I found. We can also see how rounding a factor can reduce the noise in a graph at the sacrifice of information. This rounding factor, however, makes the correlation much more clear to the eye.
These graphs show a few things. First, we can see what we originally intended to discover with these plots in that the pH appears to be independent of the volatile acidity and citric acid variables. The fixed acidity does appear to have a large impact on the pH level of the wine. In addition, we can also see from the coloration gradient that there isn’t a concentration of high or low quality wines in any region of the graph. This would imply that these variables do not have a significant impact on the quality variable. This is born out by the coefficient we generated in our ggpairs plot that show very weak correlations between these variables and quality. The strongest correlation was fixed acidity with a coefficient of ~ -.2.
Finally, we see here the strongest relationship that was uncovered by our analysis. As you increase the level of residual sugar, the density of the wine increases on a very strong linear relationship. This makes complete sense as sugar is more dense than wine. Thus, the more that you add the more dense the wine becomes. The residual sugar also reflects how sweet the wine is. It appears that higher quality white wines in general are less sweet and therefore less dense as well. This is also reflected in the graph by the color gradient as you can see the high quality grouping below the trendline and the low quality grouping above the trendline.
I feel that this dataset was challenging in that I knew very little about the variables included with the dataset before beginning this project. Although I was able to gain a comfortable grasp of them as time progressed, the challenge was in figuring out how they relate to each other and how to analyze them. This, as with any data analysis, is the essential challenge of EDA. Additionally, the fact that all of the variables were quantitative made it difficult to facet or group graphs in any particular way. I think if I was going to do the analysis again I might look to group wines for variables like total sulfur dioxide and pH levels that fall within the sweet spot of not being offensive vs those that are. E.g.: wines within 2.9-3.5 pH being ‘normal’ vs ‘too acidic’ vs ‘too basic.’ I was also surprised to see that there were really no strong correlations between any of these variables and the quality of the wine. Finally, I was very optimistic when constructing my model that I might be able to generate some sort of predictive ability about the quality of wines. This turned out not to be the case as the model I made is about as primitive as it gets. I also feel as though I was a little repetitive in terms of the types of plots that I was generating, but I struggled to think of ways to illustrate the data in new ways that would be insightful.